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1.   INTRODUCTION 

With  the  introduction  of  LSI  (large  scale  integration)  techniques 
into  the  computer  world,  the  cost  of  computer  hardware  is  dropping  sharply. 
Computer  designers  can  now  enjoy  substantial  performance  improvement  through 
a  multiprocessing  system  without  having  to  pay  the  price  of  a  much  more  expen- 
sive system.  Using  LSI,  the  cost  of  a  system  is  no  longer  in  terms  of  the 
number  of  gates  in  the  system.  A  much  larger  number  of  gates  can  be  put  in  an 
LSI  chip;  actually  the  number  of  pins  available  is  the  main  limitation.  More- 
over, the  cost  of  n  similar  processors  wired  in  a  multiprocessor  is  considerably 
less  than  n  times  the  cost  of  each  individual  processor  wired  in  a  uniprocessor. 
So  the  cost  does  not  necessarily  increase  linearly  with  the  processing  power. 

Our  motivation  for  a  big  fast  machine  comes  from  linear  algebra  compu- 
tation.  In  nearly  all  aspects  of  engineering,  physics  and  statistics,  solutions 
of  large  systems  of  linear  equations  are  fundamental.  Although  any  general 
purpose  machine  can  do  any  programmable  problem  if  given  enough  time,  there 
exist  some  linear  system  problems  that  are  so  big  that  they  can  never  get 
fully  debugged  on  a  slow  computer  due  to  the  extremely  lengthy  time  to  get 
each  run.   The  introduction  of  faster  machines  may  hopefully  lead  to  solutions 
of  such  problems. 

The  higher  cost-effectiveness  due  to  LSI  technologies  and  the  problem 
necessities  justify  the  emergence  of  big  array  machines  like  ILLIAC  IV,  STAR, 
and  TI's  ASC  machine.   One  main  area  of  interest  in  the  development  of  a  parallel 
processing  machine  is  the  compiler  techniques  such  as  parallelism  detection  and 


scheduling.   This  area  is  currently  being  tackled  by  a  number  of  independent 
researchers  [1]. 

Another  area  of  interest  is  the  organization  problem.   This  is  the 
problem  of  organizing  components  into  the  system  to  give  the  best  efficiency 
in  doing  a  given  set  of  computations.   It  should  be  noted  that  different 
computation  and  data  structures  could  induce  different  machine  organization 
parameters.   Since  solutions  of  linear  equation  systems  are  among  the  most 
heavily  used  computations,  it  would  be  desirable  if  we  can  discover  some 
parameters  of  machine  architecture  by  carefully  analyzing  some  existing  linear 
algebra  subroutines. 

This  thesis  analyzes  in  depth  a  set  of  matrix  eigenvalue  solution 
programs  named  the  EISPACK  routines.  The  first  section  will  look  into  some 
Fortran  Analyser  results  and  will  be  dealing  with  the  following  problems: 

1.  What  is  the  average  number  of  processors  required  in  the  machine 
to  achieve  the  best  speedup? 

2.  What  is  the  upper  bound  of  the  average  speedup  we  will  be  get- 
ting with  such  a  machine? 

3.  Is  this  machine  efficient  enough,  i.e.,  what  percent  of  the  time 
is  the  machine  doing  useful  work? 

k.  How  would  the  speedup  and  efficiency  of  our  machine  vary  when 
the  number  of  processors  is  reduced? 

The  second  section  presents  some  hand  analyzed  results  that 
attempt  to  tackle  the  following  problems : 


1.  How  should  the  data  be  accessed  in  the  system? 

2.  How  should  the  data  he  routed  between  the  memory  system  and  the 
processor  system? 

3.  What  should  the  instruction  set  of  this  machine  contain  to  most 
efficiently  process  the  given  set  of  computations? 

In  the  conclusion,  an  attempt  will  be  made  to  lay  down  some  of  the 
parameters  of  machine  organization  inferred  by  the  analysis  of  the  EISPACK 
programs . 


2.   ANALYZER  EESULT 


A.   Definitions 

Some  definitions  have  to  be  spelled  out  before  the  analysis  results 

are  presented.  We  define  T  as  the  time  required  to  perform  some  computation 

Tl 
using  p  processing  elements.   Then  the  speedup  S   is  defined  as  r^5,  .  We 

Tx  "        P  P 

define  efficiency  E   as  — —  .   Let  0   be  the  number  of  operations  executed 

P      pTp         p 

in  performing  the  same  computation  using  p  processing  elements.   Then  we 

0  0 

define  operation  redundancy  as  -~—   .   Utilization  is  defined  as  U_  =  ^~ 

P 


T±  p   pT, 


We  define  first  order  of  x,  0-  (x)  as  x+a  and  the  second  order 
of  x,  0p(x)  as  bx+c,  where  a,b,c  are  constants.  We  also  define  the  speed- 


T 
up  type  ratio  a   -  -- 


logg^ 


B.    Model 

EISPACK  programs  written  in  Fortran  are  analyzed  with  the  help  of 
the  Fortran  Analyzer,  implemented  by  Kuck  and  his  group.   In  the  Fortran 
Analyzer,  a  number  of  assumptions  are  made  about  the  machine  organization. 
They  are  discussed  in  [2]  and  can  be  summarized  as  follows: 

1)  I/O  operations  are  ignored. 

2)  Control  Unit  timing  is  also  ignored,  assuming  that  instructions 
are  always  available  for  execution  and  are  never  held  up  by  a  control  unit. 

3)  All  processors  are  capable  of  executing  any  of  the  four  arith- 
metic operations  in  the  same  amount  of  time,  called  unit  time. 

k)     All  supplied  Fortran  functions  are  evaluated  using  a  fast 
scheme  proposed  by  DeLugish  [3],  which  will  allow  each  function  to  be  evalu- 
ated in  no  more  than  a  few  multiply  times. 


5)  A  many-way  jump  processor  as  proposed  by  Davis  [k]   is  assumed. 
This  processor  can  determine  in  unit  time  which  program  statement  is  the 
successor  to  the  statement  at  the  top  of  a  tree  of  IF  statements. 

6)  Memory  cycle  time  is  assumed  to  be  unit  time,  and  it  is  further 
assumed  that  there  are  no  memory  accessing  conflicts. 

7)  We  also  assume  the  existence  of  an  instantaneously  operating 
alignment  network  which  can  transmit  data  from  memory  to  memory,  from  pro- 
cessor to  processor  and  between  memories  and  processors. 

Since  no  analyzer  can  have  unlimited  capabilities,  certain  limita- 
tions are  imposed  on  the  EISPACK  programs  to  be  analyzed  [5]: 

1)  No  more  than  ^0  statements  are  allowed  in  a  DO  loop. 

2)  Up  to  7  subscripts  are  allowed  for  each  variable. 

3)  Up  to  10  variables  are  allowed  on  the  right-hand  side  of  an 
assignment  statement  in  a  DO  loop  when  subscripts  and  function  arguments  are 
not  counted. 

k)  No  more  than  100  blocks  or  roughly  100  assignment  statements 
are  allowed. 

5)  No  more  than  50  label  uses  and  50  label  definitions  are  allowed. 

6)  Logical  variables  and  logical  assignment  statements  are  not 
allowed. 

Hence  some  large  EISPACK  programs  have  to  be  broken  down  into  smaller 
modules  to  fit  into  the  above  limitations.   Large  loops  are  analyzed  sepa- 
rately with  the  outermost  loops  discarded;  we  call  these  'stripped*  loops. 
This  presents  the  danger  of  neglecting  the  parallelism  due  to  the  outermost 


loops.  However,  looking  at  the  programs  more  carefully  reveals  that  only  one 
of  the  15  big  loops  presented  shows  a  possibility  of  parallelism  in  the  outer- 
most DO-loop.   The  rest  of  the  loops  have  one  sort  or  another  of  recurrence 
relations  embedded  in  them  which  cannot  be  handled  efficiently  by  ordinary  algo- 
rithms that  deal  with  recurrence  relations.   Hence,  imposing  the  restriction 
that  large  loops  are  to  be  computed  serially  will  not  hamper  the  analyzer  results 
by  a  great  amount. 

The  analysis  techniques,  which  can  be  treated  as  compiler  models  for 
the  given  machine,  are  described  fully  by  Kuck,  Muraoka  and  Chen  [6].   In  short, 
the  Analyzer  breaks  a  Fortran  program  into  blocks  of  assignment  statements  (BAS)., 
DO  loop  blocks;  and  IF  tree  blocks.   During  analysis,  T^  T  ,  p,  and  0  for 
each  block  is  computed  separately.   Then  using  IF  and  GOTO  statements,  all  the 
traces  through  the  program  are  found  and  the  block  T  ,  T   and  0   are 
accumulated  for  each  trace  to  give  T^  T   and  0   of  the  trace.   The  maximum  , 
p  found  in  any  block  in  each  trace  become  p.  R  ,  E  ,  S  ,  and  U  are  calculated 
for  each  trace.   In  each  BAS,  forward  substitution  is  made  whenever  possible. 
Then  using  tree-height  reduction  techniques,  the  best  speedup  result  for  that 
BAS  will  be  obtained.   For  DO  loop  blocks,  horizontal  scheme  exploits 
Type  1  parallelism  in  the  DO  loop  while  the  vertical  scheme  extracts  Type  0 
parallelism.   In  the  analysis  of  EISPACK  programs,  both  schemes  will  be  used. 

C.        A  Description  of  EISPACK 

Most  of  the  original  EISPACK  routines  are  written  in  ALGOL  and  can 
be  found  in  [7].   The  Fortran  EISPACK  programs  were  translated  from  the  corre- 
sponding ALGOL  procedures  by  B.  S.  Garbow  and  J.  M.  Boyle  of  Argonne  National 


Laboratory.   It  is  a  set  of  ~$h   subroutines.  According  to  the  type  of  the 
matrix  to  be  solved  and  how  many  eigenvalues  or  eigenvectors  the  user  needs, 
different  sequences  of  calls  to  these  subroutines  can  be  made. 

For  example,  given  a  real  nonsymmetric  matrix,  if  we  went  to  find 
all  the  eigenvalues  and  eigenvectors,  the  following  sequence  of  EISPACK 
routine  calls  will  be  needed: 

1.  Call  BALANC 

2.  Call  ELMHES 

3.  Call  ELTRAN 
k.  Call  HQR2 

5.   Call  BALBAK. 

All  the  EISPACK  sequences  are   fully  detailed  in  Boyle's  path 
chart    [8]. 

D.     Results 

Table  1  shows  the  analysis  results  using  as  many  processors 
as  required.   There  are  a  total  of  kk   program  segments  (8  original  subroutines 
had  to  be  broken  down  into  two  or  more  program  segments).   Program  segments 
that  are  'stripped'  DO  loops  are  marked  with  'Loop'  next  to  the  program 
name.   There  are  15  such  segments.   In  INVIT  and  TINVIT,  the  subroutines  are 
essentially  big  DO  loops  which  even  when  'stripped'  are  too  big  to  be 
handled  by  the  Analyzer  and  have  to  be  broken  further  in  3  and  2  segments, 
respectively.   In  these  results,  all  matrices  are  assumed  to  be  10x10  and 
all  vectors  are  of  dimension  10.  Since  the  results  for  both  the  hori- 
zontal and  vertical  schemes  are  available,  we  pick  the  best  speedup  re- 
sult out  of  the  two  and  put  it  in  Table  2.   The  choice  of  the  scheme  for 
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each  subroutine  is  shown  in  Column  2  of  the  table.   The  number  of  pos- 
sible traces  in  each  subroutine  is  shown  in  Column  11.   The  remaining 
columns  have  heading  abbreviations  that  are  explained  in  Section  2.  A. 
The  entries  are  the  average  values  for  different  paths  of  each  sub- 
routine. One  exception  is  that  the  a  values  in  Column  12  are  computed 
from  Columns  3  and  5  of  each  program  segment. 

E.   Observations  and  Interpretations 

1.  Twenty-seven  out  of  hM-   programs  produce  better  speedup  results 
when  the  Analyzer  employs  the  horizontal  cut  scheme,  compared  to  only  13 
when  the  vertical  cut  scheme  is  used  (the  remaining  four  produce  similar 
results  for  both  horizontal  and  vertical  cuts).   Since  most  of  the  EISPACK 
programs  deal  with  vectors  that  show  Type  1  parallelism,  this  observation  is 
consistent  with  the  discussion  in  Section  2.B. 

2.  The  average  number  of  processors  used  in  the  programs  analyzed  i 
about  99.  It  ranges  from  as  low  as  10  to  as  high  as  300.  The  average  speedup 
obtained  is  around  20.  This  is  a  good  speedup  for  the  dimension  of  the  arrays 
used  is  10.  We  can  also  observe  that  the  average  efficiency  is  27$.  Actually 
most  of  the  bad  results  come  from  the  bigger  segments  (T  >  5000).  If  we  con- 
sider only  the  program  segments  with  T  <  5000  (there  are  h-0  such  segments), 
the  average  number  of  processors  used  is  90,  average  speedup  is  21,  and  averag 
efficiency  is  29$.  A  general  rule  of  multiprocessing  is  that  an  efficiency 
of  more  than  20$  is  acceptable.  Hence,  we  can  claim  that  the  EISPACK  programs 

are  suitable  for  parallel  processing. 

T 

3.  From  the  a   (=  y^ — m— )   column,  we  can  see  that  there  are  two 

programs  where  a's  are  less  than  1.   There  are  31  program  segments  whose  a's 
are  between  1  and  10  and  they  are  called  the  Type  1  programs.   The  remaining  ] 


16 

have  a  values  higher  than  10  and  will  "be  known  as  Type  2  programs. 

For  further  investigation,  a  few  programs  are  chosen  and  the  DO 
loop  limit  is  increased  to  20,  30,  and  kO.      The  result  is  shown  in  Table  3- 
From  the  table  we  can  see  that  the  two  programs  with  a's  less  than  1  have 
relatively  constant  T  's.   These  correspond  to  the  Type  0  definition  by  Kuck  [9]. 
A  multiprocessing  system  can  compute  a  Type  0  program  in  the  same  amount  of 

time  regardless  of  the  number  of  iterations  required  to  be  done.   The  speedup 

T 
(=  ^  )  will  then  be  of  the  order  of  T^  0  (T  ). 

The  a  values  for  Type  1  programs  are  relatively  constant  over  the 

different  DO  loop  limits.   So  the  minimum  time  to  compute  a  Type  1  program 

T   is  0  (log  T  ).   By  forward  substituting  each  BAS  (Block  of  Assignment 

Statements)  to  its  full  extent,  and  carrying  out  normal  tree-height  reduction 
techniques,  we  can  see  that  most  program  segments  can  be  computed  in  a  time  on 

the  order  of  logpT  ,  0  (log  T  ).   This  is  in  full  accordance  with  the  result  we 
have  found  in  Table  3«   The  same  definition  for  Type  1  is  also  given  by  Kuck  [9], 
For  Type  2  programs,  the  a  values  no  longer  are  constant  over  different 
DO  loop  limits.   However,  we  can  see  that  the  speedups  of  these  programs  are 
closely  related  to  log  T, .   The  consistency  of  this  result  is  shown  even  more 
clearly  in  Figure  1(a)  and  Figure  1(b).   Figure  l(a)  plots  the  11  Type  2  program 
segments.  Figure  l(b)  plots  the  effect  on  speedups  by  the  variation  of  DO 
Loop  limits  for  ORTHES  and  TRED1.   It  is  readily  observed  that  S  =  k^Log  T  + 

k  ,  where  k  and  k  are  constants,  or  S  =  0  (log  T  ).  However,  until  now, 

there  are  still  no  apparent  reasons  for  programs  to  behave  like  this.  The 

Type  2  definition  by  Kuck  [9]  is  different  and  offers  a  plausible  explanation 

T 
of  program  behavior.   However,  his  prediction  that  *■ ~  values  for  Type  2 

programs  will  be  roughly  constant  does  not  fit  well  with  the  EISPACK  programs. 
k.     When  we  are  dealing  with  actual  machine  design,  however,  we 
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Figure  l(a).      Speedup  Graph  for  Type  2   Programs 
(DO  Loop   Limit   =   10 ) 
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Figure  l(b).      Speedup  Graphs   for  ORIHES  and  TRED1 
(DO  Loop  Limit  Varying) 
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cannot  assume  the  machine  to  possess  arbitrarily  many  processors.  Hence 
we  have  to  limit  ourselves  to  a  machine  with  a  finite  number  of  processors. 
Since  the  previous  results  show  that  the  average  number  of  processors  used 
is  around  99>  we  hypothesize  a  machine  with  100  processors.  With  fewer 
processors,  we  can  expect  the  speedup  to  degrade  a  little  bit.   Nevertheless, 
the  efficiency  will  go  up  as  the  number  of  processors  decreases.  At  some 
point  in  between,  we  should  be  able  to  find  that  both  the  speedup  and  effi- 
ciency are  reasonably  high.   So  the  next  step  is  to  check  how  much  the 
results  will  change  if  we  fix  the  number  of  processors  in  our  hypothetical 
machine  to  100. 

For  each  test  program,  the  analyzer  is  capable  of  breaking  the 
jobstep  requiring  the  maximum  number  of  processors  into  two  jobsteps  each 
requiring  half  the  number  of  processors  as  before  (see  Figure  2).   Then 
the  analyzer  will  try  the  same  breakup  procedure  on  the  resulting  jobsteps. 
This  breakup  procedure  will  continue  until  the  machine  is  more  than  50$ 
utilized.   For  each  program,  the  analyzer  provides  the  result  for  each 
breakup  operation  on  each  independent  path  through  the  program.  Assuming 
speedup  and  efficiency  are  equally  important  in  actual  machine  design,  we 
use  the  product  (speedup  x  efficiency)  as  the  choice  criterion.   So,  for 
each  path,  we  choose  the  breakup  result  with  the  greatest  (speedup  x  effi- 
ciency) value.  However,  the  efficiency  value  is  calculated  with  regard  to 
the  number  of  processors  used  for  that  particular  path,  not  the  number  of 
processors  available  in  the  machine.   So  the  efficiency  for  programs  using 
a  small  number  of  processors  is  actually  smaller  than  that  shown  in  the 
result.   In  order  to  compensate  this  fact,  we  add  an  extra  criterion  that 


The  only  reason  why  this  efficiency  enters  the  choice  criterion  is  that 
since  we  still  do  not  know  the  optimal  number  of  processors  in  a  machine 
(100  is  only  a  crude  suggestion  offered  by  the  previous  results).   The 
choice  criterion  should  allow  us  to  find  a  better  possibility  for  the 
number  of  processors. 
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(a).      Before  Breakup  Procedure  is  Applied 
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(b).  After  One  Application  of  the  Breakup  Procedure 
Figure  2.   Breakup  Procedure 
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if  the  number  of  processors  used  is  less  than  50,  the  breakup  result  with 
the  highest  speedup  is  chosen.   In  short, 

speedup  x  efficiency    50  <  p  <  100 


Choice  criterion  = 


speedup  p  <  50 


Then  the  averages  of  the  chosen  results  of  all  the  paths  in  a  pro- 
gram are  computed  and  tabulated  as  in  Table  k.     V  and  H  in  Column  1  corre- 
spond to  the  scheme  being  chosen — vertical  cut  scheme  or  horizontal  cut 
scheme.   Subscript  of  <»  implies  that  the  results  are  equivalent  to  that  of 
the  infinite  processor  configuration.   Subscript  of  M  implies  that  breakup 
procedure  has  been  applied  and  the  results  are  some  intermediate  values. 

As  can  be  seen,  the  new  average  number  of  processors  used  is  5>k,    a 
saving  of  nearly  50$  of  that  used  in  the  infinite  processor  configuration. 
The  speedup  decreases  from  20.3  to  l6A,  a  degradation  of  only  20$.   This 
speedup  degradation  is  compensated  by  a  corresponding  20$  increase  in  efficiency 
So  we  can  conclude  that  using  a  limited  processor  configuration  does  not  degrade 
the  performance  of  the  multiprocessing  system.   Furthermore,  from  both  the  speec 
up  and  efficiency  standpoints,  we  can  also  deduce  that  probably  the  optimal  num- 
ber of  processors  in  a  multiprocessing  system  aimed  at  computing  linear  algebra 
should  be  between  55  and  100.  However,  it  should  be  noted  that  all  matrices  are 
assumed  to  be  10  by  10,  and  in  reality  most  of  the  EISPACK  usages  are  dealing 
with  much  larger  matrices.   The  number  of  processors  required  will  then  be 
much  more  than  what  is  shown  here.  However,  due  to  the  limited  capabilities 
of  the  Fortran  Analyzer,  we  cannot  provide  similar  results  for  matrices  with 
higher  dimensions.  From  the  incomplete  results  we  have  in  Table  3,  the  average 
number  of  processors  used  for  the  six  programs  shown  are  82,  317,  706  and 
12^9  when  the  dimensions  of  the  matrices  are  assumed  to  be  10,  20,  30  and 
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ho,    respectively.  Hence,  we  might  extrapolate  that  the  optimal  number  of 

V 


2     2 
processors  used  is  of  the  order  of  n  (0  (n  )),  where  n  is  the  average  di- 


mension of  the  matrices  used. 

F.   Typical  Path  Results 

As  pointed  out  in  Section  2.C,   whenever  a  user  wants  to  use  the 
EISPACK  programs  to  calculate  eigenvalues  or  eigenvectors,  he  usually  uses 
a  series  of  EISPACK  routine  calls.   Depending  on  what  kind  of  matrix  he  has 
and  whether  he  wants  all  eigenvalues  and  eigenvectors  or  only  some  of  the 
eigenvectors  or  even  only  eigenvalues,  he  will  be  using  different  paths  of 
subroutine  calls.   Tables  2  and  k   only  give  speedup  results  for  individual 
subroutines.   To  find  out  how  much  speedup  will  result  when  one  particular 
path  of  subroutine  calls  is  run  on  a  multiprocessing  system,  we  have  to 
compute  the  speedup  for  the  path  from  the  existing  results.   The  serial  time 
(T  )  and  parallel  time  (T  )  for  each  program  segment  used  in  the  path  are 

added  up  to  form  Path  T  and  Path  T  .   The  program  segment  times  have  to  be 

weighted  10  times  if  the  program  segment  is  marked  'Loop',  i.e.,  the  segment 
is  only  a  single  iteration  of  a  big  loop  in  the  subroutine.   The  average 
speedup  for  the  path, 

Path  T 

Path  S  =  .,_,,.  m   • 
p   Path  T 
P 

If  our  multiprocessing  system  is  also  a  multiprogramming  system,  then  we  can 
define  average  number  of  processors  needed  for  each  path,  P(ave),  as  the 
average  of  those  of  individual  segments.   If  it  is  not  a  multiprogramming 
system,  then  we  have  to  define  the  number  of  processors  needed  as  the  maximum 
of  those  of  individual  segments. 
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The  following  results  are  computed  for  the  most  commonly  used  paths 
and  are  in  the  order  of  their  popularities.  Eigenvalues  and  eigenvectors  for 
real  symmetric  matrices  are  most  commonly  needed  for  most  users.  Results  for 
both  the  infinite  processor  and  limited  processor  configurations  are  available. 

1 .   Real  Symmetric  (all  Eigenvalues  and  Vectors) 

Program  Segments  Used:   TRED2A,   TRED2B,   IMTQ2A,   IMTQ2B 

Infinite:   T  =  5274  +  657  x  10  +  479  +   929  x  10 

=  21613 

T  =817+  106  x  10  +  16  +  25  x  10 
P 

=  2083 

Tl 

Tf-   =  10.4 

P 
P(ave)  =  (135  +  4l  +  200  +  26  )A  =  100. 5 

Limited:    T  =  2l6l3 

T  =  22  +  25  x  10  +  927  +  106  x  10 
P 

=  2259 

Tl 

P 
P(ave)  =  (90  +  41  +   42  +  26)  A 

=  50 

2.   Hermitian  (all  Eigenvalues  and  Vectors) 

Program  Segments  Used:  HTRID1,   IMTQ2A,   IMTQ2B,   HTRIBK 
Infinite:   T  =  1832  x  10  +  U.79  +  929  x  10  +  IOO67  =  38156 

T  =  139  x  10  +  16  +  25  x  10  +  672  =  2328 

Tl 

5f  =  16A 

p 

P(ave)  =  (200  +  63  +  135  +  4l)A 
=  110 
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Limited:  T     =  38156 

T     -  1203  +  160  x  10  +  22  +  25  x  10  =  3065 
P 

Tl 

^=  12.5 

P 
P(ave)  =    (90  +  ifl  +  J+0  +  50 )  A 

=   55-3 

3.   Real  Non-Symmetric  (Eigenvalues  Only) 

Program  Segments  Used:  BAIAWC,  ELMHES,  HQPA,  HQRB 
Infinite:   T  =  991  +  20&7  +  30^  +  28l  x  10 

-  6192 

T  =  kk  +   110  +  30  +  320  =  50k 
P 

Tl 

p 

P(ave)  =  (115  +  Ikk  +   95  +  86) /k  =   110 

Limited:    T  =  50  +  218  +  kk  +   380  =  692 
P 

Tl 

Tj-  =   9.0 
P 

P(ave)  =  (93  +  80  +  26  +  kk)/k  =   60.8 

We  can  see  that  the  average  speedups  are  around  13  for  infinite 
processor  configuration  and  the  average  number  of  processors  used  is  around 
100.  Using  limited  number  of  processors,  we  have  an  average  speedup  of  10, 
while  using  about  55  processors  on  the  average. 
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3.   MACHINE  ORGANIZATION 

This  section  attempts  to  give  more  understanding  about  how 
actual  user  programs  look  in  detail  and  hopefully  deduces  some  guide- 
lines on  how  machines  should  be  built  in  order  to  give  better  user  pro- 
gramming results.   Since  EISPACK  programs  operate  on  matrices  and  vectors, 
the  following  results  are  most  relevant  for  a  machine  built  for  vector 
operation  purposes.  However,  the  deduction  methods  used  here  are  quite 
general.   So  machines  aiming  at  computing  programs  that  use  different 
data  and  program  structures  may  profitably  use  similar  analysis  techniques. 
Using  this  direction  of  attack,  we  can  hope  to  produce  a  better  under- 
standing of  what  a  high  level  language  should  contain. 

All  the  results  here  are  the  results  of  hand  analysis  of  the 
EISPACK  routines. 

A.    System  Model 

In  order  to  extract  more  information  out  of  a  set  of  test  pro- 
grams, we  first  have  to  specify  in  general  the  kind  of  multiprocessing 
system  we  are  dealing  with.  However,  since  most  of  the  specifications 
will  be  dependent  on  the  results  we  obtain  or  results  that  are  yet  to  be 
uncovered,  the  initial  specifications  outlined  here  should  be  as  general 
and  flexible  as  possible. 

We  will  be  dealing  with  a  SIMD  (Single  Instruction  Multiple  Data) 
type  of  multiprocessing  system,  an  example  of  which  is  the  ILLIAC  TV 
machine.   In  general  it  looks  like: 
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Figure  3«   Simple  Multiprocessing  System 


The  problem  to  "be  tackled  is  the  program  execution  problem.   This  includes 
three  main  areas:   data  access,  data  alignment,  and  data  processing  [9]. 

Most  high-speed  computers  have  several  random  access  memory  units 
operating  in  parallel  in  order  to  make  the  memory  bandwidth  match  the  high 
processing  speed.  However,  multiple  memories  will  create  problems  in  the 
assigning  of  data  to  different  memory  banks  without  causing  any  conflicts 
in  accessing o  Assume  we  have  m  memories  and  k  operands  to  be  accessed  at 

some  point  of  computation.   For  vector  and  array  operations,  the  operands 

-x- 
to  be  accessed  may  be  some  rows,  columns,  diagonals  or  square  partitions 

of  an  array.   If  m  <  k,  we  know  for  sure  that  we  cannot  access  all  n 

operands  in  one  memory  cycle.  When  m  =  k,  if  m  is  prime  we  can  access 

rows,  columns  or  diagonals  in  one  memory  cycle  using  a  simple  1-skew 

scheme  (Figure  k(&)).      However,  if  m  is  even,  using  1-skew  scheme 

(Figure  4(b)),  we  can  access  rows  or  columns  in  one  memory  cycle, 


All  partitions  parallel  to  the  main  diagonal  are  also  known  as  diagonals. 
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but  we  need  two  cycles  to  access  the  diagonals.   The  penalty  is  not  great, 
especially  for  programs  which  access  diagonals  sparingly.   By  using  redun- 
dant memory  units,  i.e.,  m  >  k,  it  is  possible  to  find  storage  schemes  that 
allow  conflict-free  access  of  rows,  columns  or  diagonals.   Some  of  these 
schemes  are  discussed  extensively  by  Budnik  and  Kuck  [10],  and  Lawrie  [11]. 
However,  they  all  have  drawbacks  of  one  sort  or  other.   For  example,  the 
m  =  k+1  scheme  presents  the  difficulty  in  indexing  the  memories  and  the 
m  =  2k  scheme  uses  only  one  half  of  the  available  bandwidth.   Hence  the 
choice  of  memory  storage  scheme  is  still  an  open  question. 

After  operands  are  successfully  fetched  out  of  the  memories,  they 
still  have  to  be  routed  to  corresponding  processors  in  minimum  delay  time. 
The  same  is  true  for  results  coming  out  of  the  processors  which  are  to  be 
stored  in  the  memories.  A  crossbar  switch  is  the  simplest  answer.   The  total 
network  delay  (21og  m)  is  the  smallest  known.  However,  the  number  of  gates 
required  is  enormous  (3m  for  each  bit  of  data)  especially  when  the  number 
of  processors  or  memories  is  big.   Benes  [12]  and  Batcher  [13]  designed 
relatively  simpler  alignment  networks.   One  drawback  of  Batcher's  networks 
is  the  increased  data  delay,  while  the  Benes  network  does  not  have  a 
fast  control  algorithm.   Lawrie  [11]  designed  an  Omega  Network  which  has 
a  gate  delay  of  Ulog  m  and  a  total  number  of  (3mlog  m(d  +  l/2(log  m-l))) 
gates,  where  d  is  the  number  of  bits  to  be  passed.  Although  it  cannot 
perform  some  permutations,  it  can  effect  most  of  the  common  connections 
required.   One  example  is  the  broadcasting  of  one  argument  to  a  certain 
group  of  processors.   It  can  be  shown  that,  to  transmit  a  k8   bit  word 
in  200  nsec  through  a  m  by  m  network,  the  number  of  gates  required  by 
a  Crossbar  Switch  will  be  more  than  that  of  an  Omega  Network  whenever 
m  exceeds  8.   However,  it  should  be  noted  that  Crossbar  Switch  can  be 
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made  from  off-the-shelf  integrated  circuits  (multiplexors)  while  the  Omega 
Network  elements  need  new  layouts.  Hence  the  breakeven  point  may  be  higher 
than  50. 

As  for  the  processor  system,  there  seem  to  be  two  choices.   One 
is  the  pure  vector  machine  concept,  in  which  at  any  one  time  each  pro- 
cessor is  either  doing  the  same  operation  as  others  or  is  idle.   This 
is  fairly  easy  to  implement,  assuming  some  control  vector  schemes  [Ik] 
are  being  used.   The  other  choice  is  to  allow  each  processor  to  perform 
different  operations.   To  compute  the  following  FORTRAN  segment: 

DO  10  I  =  1,4 
10  A(I)  =  B(l)  +  C(I)  +  D(l)  *  E(l) 

if  we  have  eight  processors,  using  the  first  scheme,  we  can  do  it  in 
3  unit  times;  however,  using  the  second  choice,  we  need  only  2  unit  times. 
So  the  second  scheme  will  produce  better  speedup.  Moreover,  in  the  first 
case,  half  of  the  processors  will  be  idle  all  the  time.   However,  to  imple- 
ment the  second  choice,  we  would  require  the  control  vector  to  contain 
operation  information  for  each  processor.   More  time  would  then  be  needed 
for  setting  up  the  control  vectors  and  more  space  will  also  be  needed  to 
store  the  vectors.  Hence,  if  a  cheaper  organization  is  required,  the  pure 
vector  machine  concept  will  be  sufficient.   If  the  best  speedup  is  needed, 
the  processors  will  have  to  be  able  to  do  different  operations. 

B.   Types  of  Operations 

Array  operations  can  be  subdivided  into  whole  matrix  operations 
and  one  dimensional  vector  operations.  Matrix  operations  are  those  which 
have  a  matrix  as  an  operand  or  a  result.   The  speedup  should  be  of  the  order 
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of  the  square  of  the  dimension  of  the  matrix.  However,  this  is  only  possi- 
ble if  there  are  as  many  processors  as  the  square  of  the  dimension  of  the 
matrix.   For  a  machine  with  less  processors,  the  operation  has  to  be  done 
in  smaller  partitions,  (e.g.,  square  partitions,  rows,  columns  or  diagonals). 
Vector  operations  are  those  using  only  one -dimensional  arrays,  and  scalars. 
The  speedup  will  be  of  the  order  of  the  dimension  of  the  vector.   Some  of 
the  most  common  array  operations  are  listed  and  explained  below.   The 
common  notations  used  are  shown  in  Table  5. 

1.  Whole  Matrix  Fetch  and  Store.   With  a  big  memory  system 

o 
(m  >  n  ),  each  element  of  the  matrix  should  be  stored  in  a  separate  memory 

to  avoid  any  access  conflict.   For  smaller  memory  systems  (m  <  n  ),  we  have 
to  store  by  small  partitions  (square  blocks,  rows  or  columns),  and  subse- 
quent operations  have  to  be  done  in  the  partition  chosen. 

2.  Matrix  Transfer  (M  ->  M)  and  Matrix  Exchange  (M*-»M).   The 
copying  of  one  matrix  into  the  other  is  called  matrix  transfer.   The 
swapping  of  two  matrices  is  called  matrix  exchange.  Either  of  these  oper- 
ations can  be  done  in  the  alignment  network  without  entering  the  processing 
system,  if  the  alignment  network  has  some  internal  buffers  or  registers. 

A  typical  example  of  Matrix  Exchange  that  can  be  found  in  Fortran  programs 
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DO  10  I  =  1,  10 
DO  10  J  =  1,  10 
T  =  A(I,J) 
A(I,J)  =  B(I,J) 
B(I,J)  =  T 
10  CONTINUE 
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Arguments : 


S 
V 
M 
R 
C 
D 
0 


Scalar 

Vector 

Matrix  . 

Row  of  Matrix 

Column  of  Matrix 

Diagonal  of  Matrix 

One  Dimensional  Vector 


Operations 


Operator 

LS  Log sum 

CLS  Logsums  of  all  Columns  of  a  Matrix 

RLS  Logsums  of  all  Rows  of  a  Matrix 

MAX  Maximum  of  Array  Elements 

MIN  Minimum  of  Array  Elements 

■*  Transfer 

*—*■  Exchange 


Others 


m 
n 


Number  of  Memories  of  Processors 
Dimension  of  Arrays 


Table  5.   Notations  Used  in  Array  Operations 
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3.   Broadcasting  (S«M  -*  M,  V*M  -*■  M,  V*V  -*  M) .  Broadcasting  is 

the  routing  of  data  from  a  partition  of  the  memory  system  to  a  larger 

partition  of  the  processor  system,  each  datum  in  the  memory  partition 

being  simultaneously  required  by  several  processors.  An  example  of 

broadcasting  is  shown  in  the  following  Fortran  DO-Loop: 

DO  10  I  =  1,  10 
DO  10  J  =  1,  10 
10  A(I,J)  =  A(I,J)  *  B(J) 

This  is  a  row-wise  matrix  operation  and  will  be  denoted  as  (0*R  -*  R)  under 
the  column  V*M  -*  M.   If  the  matrix  A  is  stored  across  the  memory  system 
in  a  row  major  order,  after  a  direct  route  of  A  to  the  processors,  the 
necessary  routing  pattern  for  B  is  shown  in  Figure  5(a). 


1st  ROW 


2  nd  ROW 


nth  ROW 


Figure  5(a).   Duplication  Pattern 
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This  is  called  duplication  [15]»  However,  if  the  matrix  A  is 
stored  in  a  column  major  order,  a  different  routing  pattern  is  needed: 


1st  COLUMN 


2nd    COLUMN 


n  th   COLUMN 


Figure  5(b).   Fanout  Pattern 


This  is  called  fanout  [15]. 

One  common  variation  of  broadcasting  is  the  outer  product  of 
two  vectors  (V*V  -*  M)  as  in: 

DO  10  I  =  1,  10 

DO  10  J  =  1,  10 

10  A(I,J)  -  B(I)  *  C(J) 

Here,  if  we  perform  a  duplication  of  B  first,  followed  by  a 
fanout  of  C,  we  will  finally  get  the  matrix  A  in  a  column  major  order. 
On  the  other  hand,  if  fanout  of  B  is  done  before  a  duplication  of  C,  we 
will  get  the  matrix  A  in  row  major  order.   Information  about  the  types  of 
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the  two  vectors  is  essential  in  determining  whether  a  fanout  or  a  duplica- 
tion should  be  performed  first.  Another  variation  of  broadcasting  is 
the  routing  of  a  scalar  to  all  the  elements  of  the  matrix  (S*M  -*  M).  Either 
a  Crossbar  Switch  or  an  Omega  Network  can  perform  these  kinds  of  broadcastings. 
k.      Matrix  on  Matrix  Operations  (M*M  -»  M).  All  corresponding 

elements  of  the  two  matrices  are  operated  on  simultaneously  in  different 

2  P 

processors.   The  speedup  will  be  of  the  order  n  .  As  before,  if  m  >  n  , 

2 
two  whole  matrices  can  be  operated  on.   If  m  <  n  ,    some  matrix  parti- 
tioning has  to  be  used.  A  typical  example  of  this  operation  in  Fortran 
programs  is  shown  by  the  following  loop: 

DO  10  I  =  1,  10 
DO  10  J  =  1,  10 
A(I,J)  =  A(I,J)  +  B(I,J) 
10   CONTINUE 

5.   Matrix  Reduction  (R(M)  -*•  V  or  R(M)  -*  S).  Examples  of  the 

reduction  operators  are  the  sum  of  some  elements  (LS)  or  maximum  (or 

minimum)  of  some  elements  (MAX  or  MIN) .  Some  of  the  common  examples  of 

these  reduction  operators  are: 

CLS(M)  -*  0  The  column  sums  of  the  matrix  forming  a 

new  vector. 

LS(M)  ->  S  The  sum  of  all  matrix  elements  forming  a 

new  scalar. 

MAX(CLS(M)  -O)  ->  S   A  new  scalar  is  formed  by  taking  the  maximum 

of  all  the  column  sums  of  the  matrix. 

Summation  of  k  elements  can  be  computed  by  the  logsum  method  in 

k-1 
flogpkl  steps,  i.e.,  a  speedup  of  j? r-r   .   The  maximum  of  k  elements 
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can  also  be  found  by  Tlog  kl  comparisons.   The  speedup  will  also  be 

k-1 
■p: r-r  .   In  Fortran  programs,  a  CLS  operation  will  look  like  this: 

riog2ki  &      ' 

DO  10  J  =  1,  10 

S  =  0 

DO  20  I  =  1,  10 
20  S  =  S  +  A(I,  J) 

B(j)  =  B(J)/S 
10  CONTINUE 

6.  Vector  Fetch  and  Store.  As  discussed  in  Section  3 -A,  the 
access  time  of  vectors  depends  on  the  kind  of  access  schemes  available. 
When  m  =  n  and  n  is  even,  if  1-skew  scheme  is  used,  the  access  time  for 
rows  and  columns  will  be  1  time  cycle  while  the  access  time  of  a  diagonal 
is  2  cycles. 

7.  Vector  Transfer  (V  -*  V)  and  Exchange  (V«— »V).   These  opera- 
tions are  similar  to  those  for  the  matrices.   In  both  operations,  an 
appropriate  alignment  is  needed  to  conform  the  storage  pattern  of  the 
source  vector  into  that  of  the  destination  vector. 

8.  Broadcasting.   The  only  type  of  broadcasting  operation 
that  results  in  a  vector  is  the  broadcast  of  a  scalar  to  a  vector. 

9«  Vector  on  Vector  Operations  (V«V  -+  V).   These  operations 
should  further  be  distinguished  as  like -vector-operations,  where  the 
operands  have  the  same  vector  type,  and  mixed-vector-operations,  where 
the  operands  have  different  vector  types.  An  example  of  like-vector- 
operations  can  be  shown  by  the  following  notations: 

R.R  -  R 


14-2 


An  example  of  mixed-vector-operations  is  shown  as : 

R-C  ->  R 
Like -vector-operations  are  easier  to  handle  once  the  access  scheme  for 
the  machine  is  defined.   Mixed-vector-operations  are  much  trickier.  Which 
operand  should  be  realigned  into  the  other's  form  (and  hence  the  result's 
form)  is  still  an  open  question. 

10.  Reduction  of  Vectors  (R(v)  -*  S).   The  vector  reduction  oper- 
ations are  similar  to  those  for  the  matrices. 

C.   Results 

Each  EISPACK  program  is  studied  thoroughly  and  'decompiled' 
back  to  some  type  of  vector  notations.   Subscript  indices  are  back  substi- 
tuted to  the  DO  loop  limits  level  to  investigate  any  dependency  between 
array  elements.   Independent  array  elements  are  then  configured  as  array 
partitions  (square  partitions,  rows,  columns,  diagonals  or  one  dimen- 
sional vectors)  and  all  the  parallel  vector  operations  are  recorded. 

Table  6  keeps  counts  of  the  number  of  occurrences  of  each 
operation  type  in  each  of  the  ^>k   EISPACK  subroutines.   The  count  will  be 
a  static  count  which  shows  the  number  of  places  where  a  given  operation 
can  be  found  in  the  program.   This  will  be  different  from  the  dynamic 
count  which  shows  the  number  of  times  the  given  operation  is  used  when 
the  program  is  actually  run. 

The  notations  used  in  this  Table  are  explained  in  Table  5  while 
the  column  headings  are  explained  in  Section  3«B«  Whenever  a  vector  is 
used  in  an  operation,  the  vector  type  will  also  be  marked.   It  will  be  a 
C  (Column),  R  (Row),  D  (Diagonal)  or  0  (One  dimension  vector).  When  the 
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operation  is  a  reduction,  the  type  of  reduction  operation  is  indicated 
under  the  Remark  column. 

For  completeness  of  information  about  EISPACK,  the  last  two 
columns  are  added  to  include  the  number  of  occurrences  of  loops  with 
recurrence  relations,  and  of  loops  of  scalar  operations  which  can  actu- 
ally he  computed  in  parallel.   This  information,  although  not  essential  in 
a  cheap  vector  machine,  may  he  of  great  value  in  a  fancy  general  multi- 
processing system.  An  example  of  each  of  these  loops  is  shown  in  the 
Appendix. 

D.   Observations  and  Interpretations 

1.   Out  of  the  5^  programs  analyzed,  there  are  17  that  use  matrix 
operations.   Consider  the  most  common  case  in  which  the  order  of  the 
matrix  is  much  larger  than  the  number  of  processors  in  the  system  (i.e., 
n  »  m).   There  are  two  alternatives  in  partitioning  the  large  matrix. 
The  first  one  is  to  consider  the  matrix  as  a  group  of  one  dimensional 
vectors  (rows  or  columns).   The  second  alternative  is  to  partition  the 
matrix  into  submatrices  each  of  size  Vm  by  vm.   The  first  alternative 
allows  the  use  of  already  existing  vector  instructions.   So  no  new  instruc- 
tion is  needed,  thus  simplifying  the  whole  machine  design.   Furthermore, 
parallelism  detection  of  matrix  level  operations  from  ordinary  high  level 
language  (e.g.,  Fortran,  PL/l)  programs  is  much  more  difficult  than  that  of 
vector  level  operations.  However,  from  an  efficiency  point  of  view, 
Kuck  [16]  shows  that  Vm  by  vm  square  block  processing  of  large  matrices 
offers  an  efficiency  always  higher  than  that  of  row  or  column  processing. 
Moreover,  McKellar  [17]  points  out  that  storage  by  submatrices  induces  less 
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page  faults  than  storage  by  rows  when  only  a  portion  of  the  matrix  can  he 
put  in  the  memory  at  one  time. 

In  the  EISPACK  routines  analyzed,  there  are  155  matrix  operations 
compared  with  930  vector  operations.   If  we  separate  those  operations  into 
memory  operations  (fetch  and  store)  and  processor  operations,  we  will 
have  the  refined  table  of  occurrence  frequencies  shown  in  Table  7: 


— 

One 

Whole 

Dimen- 

Matrix 

Column 

Row 

Diagonal 

sional 

Total 

Memory 
Fetch 

50(11.9) 

121(28.8) 

168(^0.0) 

16(3.8) 

65(15.5) 

1^20  (100$) 

Memory 
Store 

27(11.8) 

72(31.6) 

78(3^.2) 

1M6.2) 

37(16.2) 

228(100$) 

Pro- 
cessor 

72(19-6) 

116(31.5) 

119(32.3) 

9(2.5) 

52(1*1.1) 

368(100$) 

Parenthesized  number  is  the  percentage  of  total  operations  of  a  given  type. 


Table  7«   Frequency  Table  of  Memory  and  Processor  Operations 


We  can  see  that  only  15$  of  the  operations  in  EISPACK  are  matrix 
operations.   This  is  expected  due  to  the  structures  of  the  algorithms.   It 
can  be  readily  realized  that  in  programs  such  as  matrix  multiplication 
subroutines  there  will  be  many  more  matrix  operations.   From  the  cost- 
effectiveness  standpoint,  probably  it  is  better  to  use  strictly  vector 
operations.   However,  to  fully  exploit  the  speed  of  a  machine,  it  may  be 
desirable  to  put  in  matrix  operations. 

2.   The  most  commonly  used  matrix  operation  type  is  broadcasting. 
This  implies  that  for  a  machine  to  handle  matrix  operations,  it  has  to  have 
an  alignment  network  that  can  handle  broadcasting.   In  21  out  of  the  26 
occurrences  of  V'M  -»  M,  the  matrix  being  used  behaves  as  a  group  of  columns, 


^9 


Hence  if  the  matrix  is  stored  in  a  column  major  order,  most  of  the  time  we 
■will  be  using  duplication.   On  the  other  hand,  if  the  matrix  is  stored  in 
row  major  order,  most  of  the  time  we  will  be  using  fanout  broadcasting. 
For  the  V-V  -»  M  type  of  operations,  both  fanout  and  duplication  are  needed 
for  each  operation. 

Both  Crossbar  Switch  and  Omega  Network  can  perform-  the  broad- 
casting connection. 

3.   Cumulative  statistics  of  the  occurrences  of  some  of  the  re- 
duction operations  in  EISPACK  routines  are  tabulated  in  Table  8. 


Reductii 
Operati  ( 

3n 
Dn 

No.  of 
Occurrences 

i  of 

Occurrences 

max(ls(m)  -> 

V)  -*  S 

2 

3.5 

LS(M)  ->  S 

2 

3.5 

CLS(M)  -»  0 

11 

18.9 

RLS(M)  -*  0 

3 

5.2 

LS(V)  ->  S 

37 

63. 7 

MIN(V)  ->  S 

2 

3.5 

MAX(V)  -  S 

1 

1.7 

Total       58  100.0$ 

Table  8.  Frequency  Table  of  Reduction  Operations 

It  can  be  readily  observed  that  summation  of  vector  or  matrix 
elements  is  very  common,  and  hence  some  fast  schemes  (like  the  log  summa- 
tion method)  to  compute  these  are  essential  for  an  efficient  parallel 
machine.   It  would  also  be  nice  if  some  fast  schemes  were  available  to 
compute  the  maximum  and  minimum  of  a  group  of  elements. 

An  important  fact  which  is  not  shown  in  Table  6  is  that  in  reduc- 
tion, as  well  as  in  other  array  operations,  some  elements  are  skipped  in 
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the  operation,  i.e.,  some  elements  are  not  supposed  to  be  processed.   The 
easiest  way  to  handle  this  is  to  set  up  a  control  vector  having  as  many 
bits  as  the  number  of  processors  in  the  system.  By  appropriately  setting 
up  the  control  vector  before  doing  the  operation,  only  desired  elements 
will  be  processed. 

h.     Matrix  exchange  is  not  found  in  EISPACK  programs.   There  are 
six  occurrences  of  matrix  transfers.   So  these  are  not  dominating  factors 
in  machine  design.  However,  there  are  30  vector  transfers  found  in  the 
EISPACK  routines,  and  also  32  vector  exchanges.   This  implies  a  signifi- 
cant part  of  the  computation  of  EISPACK  programs  is  in  the  memory-to- 
memory  transfer,  where  the  processors  are  not  needed  at  all.  Hence,  if 
the  alignment  network  is  powerful  enough  to  route  data  from  some  pre- 
determined memory  locations  to  some  other  predetermined  locations  without 
having  to  pass  through  processors,  a  lot  of  processor  time  could  be  saved. 

Since  most  routing  networks  go  one  way  or  the  other,  we  should 
actually  reconfigure  Figure  3  as  Figure  6. 

In  Figure  6,  the  solid  lines  are  essential  links  while  the  dotted 
lines  are  optional.   If  the  processor  system  is  to  be  bypassed  for  array 
transfers  or  exchanges,  we  will  need  the  additional  link  of  (A)  or  (b). 
This  arrangement  will  be  logical  if  the  processor  time  is  more  valuable 
than  the  total  cost  of  the  new  link  and  additional  multiplexors.   One 
network  delay  can  also  be  saved  with  this  kind  of  arrangement. 

5.   Cumulative  statistics  for  the  vector  operations  are  tabu- 
lated in  Table  9(a)  and  Table  9(b).   Rows  and  columns  are  the  most  dominant 
vector  types.  Among  the  three  types  of  partitioning  for  a  matrix  (row, 
column  and  diagonal),  diagonals  are  used  less  than  10$>  of  the  time. 
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Figure  6.   Reconfigured  Computer  System 
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D 


Total 


Vector  Fetch 

121(13.0)* 

168(18.1) 

16(1.7) 

65(7.0) 

370(39.8) 

Vector  Store 

72(7-8) 

78(8.4) 

14(1.5) 

37(4.0) 

201(21.7) 

s-v  -»  V 

59(6.4) 

58(6.3) 

9(1.0) 

16(1.7) 

142(15-4) 

v  ->  V 

4(0.4) 

3(0.3) 

- 

12(1.3) 

19(2.0) 

V  <-*  V 

iMi.5) 

18(1.9) 

- 

- 

32(3-4) 

v-v  ->  V 

43(^.6) 

39(4.2) 

- 

l(o.l) 

83(8.8) 

Total 


313(33.7)      364(39-2)     39(4.2)     I3l(l4.l) 


Table  9(a)«      Like-Vector-Operations 


Operations 


Frequency 

16(1.7) 
9(1.0) 
7(0.8) 
3(0.3) 
8(0.9) 


R«C  -»  V 
R-0  ->  V 

c-o  -»  V 

c  -  0 
D  ->  0 


Table  9(b).   Mixed-Vector-Operations 


The  first  number  in  each  entry  is  the  number  of  occurrences  and  the  paren- 
thesized number  is  the  percentage  of  the  total  number  of  vector  operations 
(930  in  total). 
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Referring  back  to  the  discussion  of  memory  accessing  in  section  3. A,  we_ 
can  see  that  as  long  as  diagonals  are  not  used  often,  it  is  more  cost 
effective  to  use  an  m  =  n  type  of  memory  system  with  a  1-skew  storage 
scheme.  Using  this  1-skew  scheme,  diagonals  can  be  accessed  in  two 
memory  cycles.  So  if  we  start  out  with  100  vector  operations  in  a  given 
EISPACK  computation,  less  than  ten  of  them  will  be  using  diagonals. 
Hence,  the  maximum  access  time  will  be  (90  +  10  x  2)  cycles  =  110  cycles. 

The  speed  will  be  off  only  by  less  than  10$.   The  simplicity  of 
the  design  and  the  nearly  full  memory  bandwidth  more  than  compensate  the 
slight  increase  in  access  time. 

Once  again  this  discussion  is  only  valid  for  an  EISPACK  machine.' 
Other  computations  might  use  diagonals  more  often.   In  that  case,  it  might 
pay  to  use  more  exotic  storage  schemes  to  cut  down  access  time  for  di- 
agonals . 

6.  Referring  back  to  Figure  6,  the  alignment  network  delay  can 
be  avoided  if  we  are  able  to  choose  the  direct  paths  (C)  and  (D).   These 
paths  can  be  used  in  most  S»V  -*  V  type  of  operations.   They  can  also  be 
used  in  like-vector-operations  where  the  vectors  operands  begin  in  the 
same  memory  unit.  However,  to  justify  any  of  these  paths,  more  information 
on  the  indexing  patterns  is  needed,  i.e.,  we  need  to  know  the  differences 
in  skew  and  skip  distances  of  the  vector  operands. 

By  the  same  token,  if  a  large  number  of  intermediate  results 
need  to  be  realigned  before  being  operated  on  again,  one  network  delay 
and  one  memory  cycle  time  will  be  saved  by  adding  direct  links  (E)  or  (F) 
between  the  alignment  networks  and  the  processors. 
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7«   In  operations  where  two  vector  operands  are  used  (i.e.,  the 
V*V  -*  V  type  of  operations),  83  of  them  are  like-vector-operations,  32  of 
them  are  mixed-vector-operations.   This  implies  that  nearly  30$>  of  the 
time  we  are  working  on  mixed-vectors  in  vector-on-vector  operations.   How- 
ever, this  is  the  area  where  only  a  little  knowledge  is  available.   Fur- 
ther understanding  about  the  mixed-vector  mode  of  operations  would  thus 
be  very  helpful. 
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k.      CONCLUSION 

This  thesis  attempts  to  show  how  program  analysis  can  lead  to 
some  parameters  concerning  the  design  of  a  better  multiprocessing  system 
suitable  to  do  a  given  set  of  computations.   In  this  thesis,  EISPACK 
programs  are  the  target  programs  and  the  system  in  mind  is  a  SUVED 
(Single-Instruction-Stream-Multiple-Data-Stream)  type  of  multiprocessing 
complex;  however,  similar  techniques  can  be  applied  to  most  of  the  other 
program  types  and  machine  types.   The  analysis  results  are  blended  with 
the  most  recent  ideas  on  multiprocessing  systems  to  show  how  the  results 
can  help  us  find  the  more  cost  effective  design. 

In  Section  2,  we  find  the  maximum  speedup  for  programs  and  the 
number  of  processors  required  to  achieve  such  speedup  assuming  we  use 
the  fastest  known  algorithms.  Work  is  made  easier  with  the  help  of  the 
Fortran  Analyzer  which  has  all  those  algorithms  embedded  in  it.   For 

EISPACK,  the  optimal  number  of  processors  appears  to  be  of  the  order  of 

2 
n  ,  where  n  is  the  average  dimension  of  the  matrices  used.   The  speedup 

obtained  for  10  x  10  matrices  is  around  20  which  is  above  average  for  a 

multiprocessing  system.   The  overall  efficiency  is  a  tolerable  30$. 

So  the  EISPACK  programs  appear  very  suitable  for  a  multiprocessing  machine. 

It  is  further  noticed  that  most  programs  tend  to  fall  into  three  different 

types  according  to  their  speedup  behavior.  A  factor  which  determines  to 

T 

which  type  a  program  belongs  is  the  value  of  a  where  a  =  5 — ^r-   .   The 

logT.. 

cutting  points  of  a  value  for  different  types  are  not  definite.   However, 
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1  and  10  seem  to  produce  consistent  results. 

In  Section  3>  we  attempt  to  use  operation  type  statistics  to 
infer  parts  of  machine  structure.   It  should  be  noted  that  uncovering 
array  operations  from  programs  written  for  a  serial  machine  is  non-trivial, 
but  not  impossible.   Hand  analyzed  results  are  used  here.   Deciding  what 
to  look  for  in  a  program  is  another  difficult  aspect  of  the  problem.   The 
information  retrieved  from  EISPACK  is  by  no  means  complete,  yet  some 
precious  parameters  of  machine  organization  are  revealed  by  the  analysis. 
Furthermore,  the  statistics  do  not  reflect  the  true  operation  counts,  for 
the  statistics  given  in  section  3  are  only  static  counts  of  the  number 
of  occurrences  of  the  operations  in  the  program.  A  dynamic  count  of  the 
operations  used  during  execution  would  be  more  precise.   However,  this 
would  imply  the  writing  of  another  analyzer  program  which  simulates  and 
counts  the  operations  in  the  user  program.   The  work  in  this  thesis  is  a 
preliminary  step  to  the  analyzer  program. 

In  Section  3;  we  can  see  that  it  may  be  more  cost-effective  to 
use  strictly  vector  operations.   However,  in  systems  where  full  speed 
is  required,  matrix  operations  are  desirable.   In  such  systems,  the 
broadcasting  operation  is  essential,  implying  that  a  crossbar  switch  or 
an  Omega  type  network  is  necessary  for  the  alignment  network.   In  general, 
some  fast  schemes  should  be  incorporated  into  the  machine  to  do  summation 
of  array  elements  since  a  large  number  of  such  summations  are  found  in 
the  programs  analyzed.   The  high  number  of  memory-to-memory  operations 
(68  in  the  EISPACK  programs)  also  justifies  the  establishment  of  additional 
links  to  bypass  the  processor  system  in  the  architecture.   Routes  bypassing 
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the  two  alignment  networks  can  be  considered  similarly  if  more  information 
on  the  indexing  patterns  of  vectors  used  in  the  operations  is  available.  By 
comparing  the  frequencies  of  usages  of  the  three  major  matrix  partitions, 
we  can  decide  on  the  storage  scheme  to  be  used.   Here,  in  EISPACK  programs, 
the  m  =  n  type  of  memory  system  with  a  1-skew  storage  scheme  seems  to  be 
most  cost-effective. 

One  by-product  of  the  analysis  is  the  evolution  of  some  desirable 
vector  instructions.  Array  summation  and  vector  exchange  are  two  good 
examples.   The  regular  usages  of  these  operations  make  it  feasible  to 
add  them  in  the  instruction  repertoires  of  most  vector  machines  and  most 
high-level  vector  languages. 
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APPENDIX 

1.    Two  examples  of  loops  with  recurrence  relations : 

A)   Loop  200  of  program  IMTQL1: 

DO  200  II  =  1,  MML 
I  =  M  -  II 
F  -  S*E(l) 
B  =  C*E(I) 
IF (DABS (F).LT. DABS (G))G0  TO  150 

c  =  g/f 

R  =   DSQRT(C*C+1.0D0) 

E(l+l)  =  F*R 

S  =  1.0D0/R 

C  =  C*S 

GO  TO  160 
150  S  =  F/G 

R  =  DSQRT(S*S+1.0D0) 

E(l+l)  =  G*R 

C  =  1.0D0/R 

S  =  S*C 
160  G  -  D(l+l)  -  P 

R  =  (D(l)-G)*S  +  2.0D0*C*B 

P  -  S*R 

D(l+1)  =  G+P 

G  =  C  *  R  -  B 
200  CONTINUE 


59 


B)   Loop  620  of  program  TINVTT: 

DO  620  II  =  P,Q 
I  =  P  +  Q  -  II 

RV6(l)  =  (RV6(l)-U*RV2(l)-V*RV3(l))/RVl(l) 

V  =  u 
U  =  RV6(l) 
620  CONTINUE 

2.   An  example  of  a  parallel  loop  of  scalar  operations  is  in  Loop  1^0 

of  program  HQR: 

DO  140  MM  =  L,ENM2 
M  =  ENM2  +  L  -  MM 
ZZ  =  H(M,M) 
R  =  X  -  ZZ 
S  =  Y  -  ZZ 

P  =  (R*S-W)/H(M+1,M)+H(M,M+1) 
Q  =  H(M+1,M+1)  -  ZZ  -  R  -  S 
R  =  H(M+2,M+l) 

S  =  DABS(P)  +  DABS(Q)  +  DABS(R) 
P  =  P/S 
Q  =  Q/S 
R  =  R/S 
IF  (M.EQ.L)  GO  TO  150 

IF  (dabs(h(m,m-i))*(dabs(q)+dabs(r)).le.machep 
X    *dabs(p)*(dabs(h(m-i,m-i))+dabs(zz)+dabs(h(m+i, 

X      M+l))))  GO  TO  150 
140  CONTINUE 
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any  set  of  meaningful  Fortran  programs. 
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